Cardiovascular Diseases Report¶
Introduction¶
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year [1]. CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions [1]. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age [2].
This data was taken from Kaggle and collects characteristics of Heart Attack or factors that contribute to it.
In this project, utilizing a database that captures key factors such as gender, age, blood glucose, and blood pressure of participants, we aim to develop a classification model to predict the likelihood of a new patient experiencing a heart attack.
By using a classification model, we would predict the presence of a heart attack based on the main predictors. The question we will be addressing is: Is a new patient likely to have heart disease, based on age, troponin and kcm?
Methods¶
Since the variable “class”, which depicts the presence of a heart attack, is a categorical variable, we choose to conduct our data analysis by classification, specifically by using K nearest neighbors classification algorithm.
To help visualize our results and our predictions, we can plot a scatter plot to help determine what factors contribute to the presence of a heart attack. Select only the columns of data we are interested in using for our prediction.
The columns we will be using are as follows:
- age: age of the patients
- kcm: amount of specific enzymes(CK-MB) - renamed to enzyme_amount
- troponin: Test-Troponin in (ng/L [4])
- class: diagnosis type - negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.
We do a KNN-classification analysis for 3 models: the first model (age,troponin) and the second (age, enzyme_amount) leverage two predictors, while third model (age,troponin and enzyme_amount) uses all three.
Preliminary exploratory data analysis¶
To proceed with our project in R and Jupyter Notebook, we loaded the necessary packages/libraries, to use the necessary functions.
# Install required packages if not already installed
if (!requireNamespace("GGally", quietly = TRUE)) install.packages("GGally")
if (!requireNamespace("ISLR", quietly = TRUE)) install.packages("ISLR")
# Load the packages
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
# Set options
options(repr.matrix.max.rows = 6)
Registered S3 method overwritten by 'GGally': method from +.gg ggplot2 ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.3 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── ✔ broom 1.0.5 ✔ rsample 1.2.0 ✔ dials 1.2.0 ✔ tune 1.1.2 ✔ infer 1.0.4 ✔ workflows 1.1.3 ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1 ✔ parsnip 1.1.1 ✔ yardstick 1.2.0 ✔ recipes 1.0.8 ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Dig deeper into tidy modeling with R at https://www.tmwr.org
Loading data from the web¶
To read the dataset from the web, we uploaded to Github manually and use the read_csv function into the notebook, then store it as a data frame.
url = "https://raw.githubusercontent.com/l-glucose/dsci100/main/data/heart_attack.csv"
raw_data <- read_csv(url, show_col_types = FALSE)
raw_data
| age | gender | impluse | pressurehight | pressurelow | glucose | kcm | troponin | class |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
| 64 | 1 | 66 | 160 | 83 | 160 | 1.80 | 0.012 | negative |
| 21 | 1 | 94 | 98 | 46 | 296 | 6.75 | 1.060 | positive |
| 55 | 1 | 64 | 160 | 77 | 270 | 1.99 | 0.003 | negative |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 45 | 1 | 85 | 168 | 104 | 96 | 1.24 | 4.250 | positive |
| 54 | 1 | 58 | 117 | 68 | 443 | 5.80 | 0.359 | positive |
| 51 | 1 | 94 | 157 | 79 | 134 | 50.89 | 1.770 | positive |
First, using the ggpairs library create a pairplot (also called "scatter plot matrix") of all the columns from the dataset to see the relationship between the response variablesclass and the other variables, then choose some proper variables as the predictors.
options(repr.plot.height = 10, repr.plot.width = 10) # Modifies the size of the plots
pairplot <- raw_data |>
ggpairs(
lower = list(continuous = wrap('points', alpha = 0.4)),
diag = list(continuous = "barDiag")
) +
theme(text = element_text(size = 10))
pairplot
options(repr.plot.height = 7, repr.plot.width = 8)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.